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Abstract. We consider the problem of computing the value and an optimal strat- 
egy for minimizing the expected termination time in one-counter Markov deci- 
sion processes. Since the value may be irrational and an optimal strategy may be 
rather complicated, we concentrate on the problems of approximating the value 
up to a given error e > and computing a finite representation of an e-optimal 
strategy. We show that these problems are solvable in exponential time for a given 
configuration, and we also show that they are computationally hard in the sense 
that a polynomial-time approximation algorithm cannot exist unless P=NP. 



1 Introduction 

In recent years, a lot of research work has been devoted to the study of stochastic ex- 
tensions of various automata-theoretic models such as pushdown automata, Petri nets, 
lossy channel systems, and many others. In this paper we study the class of one-counter 
Markov decision processes (OC-MDPs), which are infinite-state MDPs [21, 15] gener- 
ated by finite-state automata operating over a single unbounded counter. Intuitively, an 
OC-MDP is specified by a finite directed graph J{ where the nodes are control states 
and the edges correspond to transitions between control states. Each control state is ei- 
ther stochastic or non-deterministic, which means that the next edge is chosen either 
randomly (according to a fixed probability distribution over the outgoing edges) or by 
a controller. Further, each edge either increments, decrements, or leaves unchanged the 
current counter value. A configuration q(i) of an OC-MDP J{ is given by the current 
control state q and the current counter value i (for technical convenience, we also allow 
negative counter values, although we are only interested in runs where the counter stays 
non-negative). The outgoing transitions of q(i) are determined by the edges of J{ in the 
natural way. 

Previous works on OC-MDPs [5, 3,4] considered mainly the objective of maximiz- 
ing/minimizing termination probability . We say that a run initiated in a configuration 
q(i) terminates if it visits a configuration with zero counter. The goal of the controller 
is to play so that the probability of all terminating runs is maximized (or minimized). 



* Tomas Brazdil and Petr Novotny are supported by the Czech Science Foundation, grant 
No. P202/12/G061. Antomn Kucera is supported by the Czech Science Foundation, grant 
No. P202/10/1469. Dominik Wojtczak is supported by EPSRC grant EP/G0501 12/2. 



In this paper, we study a related objective of minimizing the expected termination time. 
Formally, we define a random variable T over the runs of J[ such that T(oj) is equal ei- 
ther to oo (if the run a> is non-terminating) or to the number of transitions need to reach 
a configuration with zero counter (if oj is terminating). The goal of the controller is to 
minimize the expectation E(T'). The value of q(i) is the infimum of E(T) over all strate- 
gies. It is easy to see that the controller has a memoryless deterministic strategy which is 
optimal (i.e., achieves the value) in every configuration. However, since OC-MDPs have 
infinitely many configurations, this does not imply that an optimal strategy is finitely 
representable and computable. Further, the value itself can be irrational. Therefore, we 
concentrate on the problem of approximating the value of a given configuration up to 
a given (absolute or relative) error s > 0, and computing a strategy which is s-optimal 
(in both absolute and relative sense). Our main results can be summarized as follows: 

- The value and optimal strategy can be effectively approximated up to a given 
relative/absolute error in exponential time. More precisely, we show that given 
a OC-MDP Jl, a configuration q(i) of Jl where i > 0, and s > 0, the value of 
q(i) up to the (relative or absolute) error s is computable in time exponential in 
the encoding size of J?l, i, and s, where all numerical constants are represented 
as fractions of binary numbers. Further, there is a history- dependent deterministic 
strategy cr computable in exponential time such that the absolute/relative difference 
between the value of q(i) and the outcome of cr in q(i) is bounded by s. 

- The value is not approximable in polynomial time unless P=NP. This hardness 
result holds even if we restrict ourselves to configurations with counter value equal 
to 1 and to OC-MDPs where every outgoing edge of a stochastic control state has 
probability 1 /2. The result is valid for absolute as well as relative approximation. 

Let us sketch the basic ideas behind these results. The upper bounds are obtained in two 
steps. In the first step (Section 3.1), we analyze the special case when the underlying 
graph of J{ is strongly connected. We show that minimizing the expected termination 
time is closely related to minimizing the expected increase of the counter per transition, 
at least for large counter values. We start by computing the minimal expected increase 
of the counter per transition (denoted by x) achievable by the controller, and the asso- 
ciated strategy cr. This is done by standard linear programming techniques developed 
for optimizing the long-run average reward in finite-state MDPs (see, e.g., [21]) applied 
to the underlying finite graph of M. Note that cr depends only on the current control 
state and ignores the current counter value (we say that cr is counterless). Further, the 
encoding size of x is polynomial in ||j?l||. Then, we distinguish two cases. 

Case (A), x > 0. Then the counter does not have a tendency to decrease regardless 
of the controller's strategy, and the expected termination time value is infinite in all 
configurations q(i) such that ;' > \Q\, where Q is the set of control states of M (see 
Proposition 5. A). For the finitely many remaining configurations, we can compute the 
value and optimal strategy precisely by standard methods for finite-state MDPs. 

Case (B), x < 0. Then, one intuitively expects that applying the strategy cr in an ini- 
tial configuration q(i) yields the expected termination time about i/\x\. Actually, this 
is almost correct; we show (Proposition 5. B.2) that this expectation is bounded by 
(i + U)/\x\, where U > is a constant depending only on J[ whose size is at most 
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exponential in Further, we show that an arbitrary strategy n applied to q(i) yields 
the expected termination time at least (i - V)/\x\, where V > is a constant depending 
only on J?[ whose size is at most exponential in (Proposition 5. B.l). In particular, 
this applies to the optimal strategy n* for minimizing the expected termination time. 
Hence, if can be more efficient than <x, but the difference between their outcomes is 
bounded by a constant which depends only on J\ and is at most exponential in 
We proceed by computing a sufficiently large k so that the probability of increasing the 
counter to ; + k by a run initiated in q(i) is inevitably (i.e., under any optimal strategy) 
so small that the controller can safely switch to the strategy <x when the counter reaches 
the value ;' + k. Then, we construct a finite-state MDP At and a reward function / over 
its transitions such that 

- the states are all configurations p(j) where < j < i + k; 

- all states with counter values less than i + k "inherit" their transitions from 
configurations of the form p(i + k) have only self-loops; 

- the self-loops on configurations where the counter equals or i+k have zero reward, 
transitions leading to configurations where the counter equals i + k have reward 
(i + k + U)/\x\, and the other transitions have reward 1. 

In this finite-state MDP At, we compute an optimal memoryless deterministic strategy 
g for the total accumulated reward objective specified by /. Then, we consider another 
strategy & for q(i) which behaves like g until the point when the counter reaches i + k, 
and from that point on it behaves like cr. It turns out that the absolute as well as relative 
difference between the outcome of & in q(i) and the value of q(i) is bounded by s, and 
hence & is the desired e-optimal strategy. 

In the general case when J?[ is not necessarily strongly connected (see Section 3.2), 
we have to solve additional difficulties. Intuitively, we split the graph of J{ into max- 
imal end components (MECs), where each MEC can be seen seen as a strongly con- 
nected OC-MDP and analyzed by the techniques discussed above. In particular, for 
every MEC C we compute the associated x c (see above). Then, we consider a strategy 
which tries to reach a MEC as quickly as possible so that the expected value of the 
fraction l/\xc\ is minimal. After reaching a target MEC, the strategy starts to behave as 
the strategy cr discussed above. It turns out that this particular strategy cannot be much 
worse than the optimal strategy (a proof of this claim requires new observations), and 
the rest of the argument is similar as in the strongly connected case. 

The lower bound, i.e., the result saying that the value cannot be efficiently ap- 
proximated unless P=NP (see Section 4), seems to be the first result of this kind for 
OC-MDPs. Here we combine the technique of encoding propositional assignments pre- 
sented in [19] (see also [17]) with some new gadgets constructed specifically for this 
proof (let us note that we did not manage to improve the presented lower bound to 
PSPACE by adapting other known techniques [16,22, 18]). As a byproduct, our proof 
also reveals that the optimal strategy for minimizing the expected termination time can- 
not ignore the precise counter value, even if the counter becomes very large. In our 
example, the (only) optimal strategy is eventually periodic in the sense that for a suffi- 
ciently large counter value i, it is only "z modulo c" which matters, where c is a fixed 
(exponentially large) constant. The question whether there always exists an optimal 
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eventually periodic strategy is left open. Another open question is whether our results 
can be extended to stochastic games over one-counter automata. 

Related work: One-counter automata can also be seen as pushdown automata with one 
letter stack alphabet. Stochastic games and MPDs generated by pushdown automata 
and stateless pushdown automata (also known as BPA) with termination and reacha- 
bility objectives have been studied in [13, 14,6, 7]. To the best of our knowledge, the 
only prior work on the expected termination time (or, more generally, total accumulated 
reward) objective for a class of infinite-state MDPs or stochastic games is [11], where 
this problem is studied for stochastic BPA games. The termination objective for one- 
counter MDPs and games has been examined in [5, 3,4], where it was shown (among 
other things) that the equilibrium termination probability (i.e., the termination value) 
can be approximated up to a given precision in exponential time, but no lower bound 
was provided. The games over one-counter automata are also known as "energy games" 
[9, 10]. Intuitively, the counter is used to model the amount of currently available en- 
ergy, and the aim of the controller is to optimize the energy consumptions. Finally, 
let us note that OC-MDPs can be seen as discrete-time Quasi-Birth-Death Processes 
(QBDs, see, e.g., [20, 12]) extended with a control. Hence, the theory of one-counter 
MDPs and games is closely related to queuing theory, where QBDs are considered as a 
fundamental model. 

2 Preliminaries 

Given a set A, we use \A\ to denote the cardinality of A. We also write \x\ to denote the 
absolute value of a given lei, but this should not cause any confusions. The encoding 
size of a given object B is denoted by ||B||. The set of integers is denoted by Z, and the 
set of positive integers by N. 

We assume familiarity with basic notions of probability theory. In particular, we call 
a probability distribution / over a discrete set A positive if f(a) > for all a e A, and 
Dime if f(a) = 1 for some a e A. 

Definition 1 (MDP). A Markov decision process (MDP) is a tuple M = 
(S,(So,Si), ,Prob), consisting of a countable set of states S partitioned into the 
sets So and S i of stochastic and non-deterministic states, respectively. The edge rela- 
tion c S X S is total, i.e., for every r e S there is s e S such that r^> s. Finally, 
Prob assigns to every s e Sq a positive probability distribution over its outgoing edges. 

A finite path is a sequence w = sqsi ■ ■ ■ s„ of states such that s t s,-+i for all < i < n. 
We write len(w) = n for the length of the path. A run is an infinite sequence to of states 
such that every finite prefix of to is a path. For a finite path, w, we denote by Run(w) 
the set of runs having w as a prefix. These generate the standard cr-algebra on the set of 
runs. 

Definition 2 (OC-MDP). A one-counter MDP (OC-MDP) is a tuple 31 = 
(Q,(Qo, Q\),8,P), where Q is a finite non-empty set of control states partitioned into 
stochastic and non-deterministic states ( as in the case of MDPs ), 6 Q Qx{+1,0, - l}xQ 
is a set of transition rules such that 6(q) := \(q, i, r) e 8) + for all q E Q, and 
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P - [Pq}q€Q where P q is a positive rational probability distribution over 6(q) for all 
q e go- 
In the rest of this paper we often write q — r to indicate that (q, i, r) e 6, and q r 
to indicate that (q, i, r) e 6, q is stochastic, and P q {q, i, r) = x. Without restrictions, we 
assume that for each pair q,r e Q there is at most one i such that (q, i, r) e 6. The 
encoding size of is denoted by ||J?l||, where all numerical constants are encoded as 
fractions of binary numbers. The set of all configurations is C '■= [q(i) | q e Q,i e Z}. 

To 3K we associate an infinite-state MDP A1£j = (C, (Co,Ci), ~» ,Prob), where the 
partition of C is defined by e Co iff q £ 2o, and similarly for Ci. The edges are 
defined by q(i) ~» r(/) iff (g, _/' - i, r) E 6. The probability assignment Pro/? is derived 
naturally from P. 

By forgetting the counter values, the OC-MDP J{ also defines a finite-state MDP 
Mj{ = (Q, (2o>2i)> ,Pwb'). Here g^->r iff (q,i,r) e 6 for some ;, and Profo' is 
derived in the obvious way from P by forgetting the counter changes. 

Strategies and Probability. Let M be an MDP. A history is a finite path in M, and 
a strategy (or policy) is a function assigning to each history ending in a state from S \ 
a distribution on edges leaving the last state of the history. A strategy cr is pure (or 
deterministic) if it always assigns 1 to one edge and to the others, and memoryless if 
cr(w) = cr(s) where s is the last state of a history w. 

Now consider some OC-MDP J{. A strategy cr over the histories in Af^ is counter- 
less if it is memoryless and cr(q(i)) = cr(q( j)) for all i, j. Observe that every strategy cr 
for A1« gives a unique strategy cr' for Aiji which just forgets the counter values in the 
history and plays as cr. This correspondence is bijective when restricted to memoryless 
strategies in Mj\ and counterless strategies in M°^, and it is used implicitly throughout 
the paper. 

Fixing a strategy cr and an initial state s, we obtain in a standard way a probabil- 
ity measure (■) on the subspace of runs starting in s. For MDPs of the form 
for some OC-MDP J\, we consider two sequences of random variables, {C (, )};>o and 
{S (/) }(>o, returning the current counter value and the current control state after complet- 
ing ;' transitions. 

Termination Time in OC-MDPs. Let be a OC-MDP. A run to in Af^ terminates 
if co(j) = q(0) for some j > and q e Q. The associated termination time, denoted 
by T((o), is the least j such that u>( j) = q(0) for some q e Q. If there is no such j, we 
put T(co) = oo, where the symbol oo denotes the "infinite amount" with the standard 
conventions, i.e., c < oo and oo + c = oo + co = oo-d = oo for arbitrary real numbers c, d 
where d > 0. 

For every strategy cr and a configuration q(i), we use Wqii) to denote the ex- 
pected value of T in the probability space of all runs initiated in q(i) where PJ^O) 
is the underlying probability measure. The value of a given configuration q(i) is de- 
fined by Val(^(/)) := mf a -E IT q(i). Let e > and i > 1. We say that a constant v 
approximates Val(g(/)) up to the absolute or relative error e if |Val(#(z')) - v| < s or 
|Val(g(;')) - v|/Val(g(;')) < s, respectively. Note that if v approximates Val(g(/)) up to the 
absolute error e, then it also approximates Val(g(0) up to the relative error e because 
Val(g(/)) > 1 . A strategy cr is (absolutely or relatively) s-optimal if Wq{i) approximates 
Val(g(/)) up to the (absolute or relative) error s. A 0-optimal strategy is called optimal. 



5 



maximize x, subject to 

Zq < -X + k + Z r 

Z q <-X+ Z(q,k,r)e6 Pq((.<l, K r)) ■ (k + Zr) 



for all q e Qi and (q, k, r) 6 S, 
for all q € go, 



Fig. 1. The linear program £ over x and z q , q e Q. 



It is easy to see that there is a memoryless deterministic strategy cr in A1™ which is 
optimal in every configuration of M°^. First, observe that for all q e go, q' e Qi, and 
i + we have that 



We put cr(q(i)) = r(j) where q(i) ~> r(j) and Val(^(;)) = 1 + r(j) (if there are several 
candidates for r(j), any of them can be chosen). Now we can easily verify that cr is 
indeed optimal in every configuration. 

3 Upper Bounds 

The goal of this section is to prove the following: 

Theorem 3. Let Ji.be a OC-MDP, q(i) a configuration of 3i where i > 0, and s > 0. 

1. The problem whether Val(g(/)) = oo is decidable in polynomial time. 

2. There is an algorithm that computes a rational number v such that 
|Val(^(;)) — y| < e, and a strategy cr that is absolutely s-optimal starting in q(i). 
The algorithm runs in time exponential in \\Sl\\ and polynomial in i and 1/s. (Note 
that v then approximates Val(g(0) also up to the relative error s, and cr is also 
relatively s-optimal in q(i)). 

For the rest of this section, we fix an OC-MDP 3\ = (Q, (2o, Q\), 6, P). First, we prove 
Theorem 3 under the assumption that Mj\ is strongly connected (Section 3.1). A gen- 
eralization to arbitrary OC-MDP is then given in Section 3.2. 

3.1 Strongly connected OC-MDP 

Let us assume that ATh is strongly connected, i.e., for all p,q e Q there is a finite path 
from p to q in At^. Consider the linear program of Figure 1. Intuitively, the variable 
x encodes a lower bound on the long-run trend of the counter value. More precisely, 
the maximal value of x corresponds to the minimal long-run average change in the 
counter value achievable by some strategy. The program corresponds to the one used 
for optimizing the long-run average reward in Sections 8.8 and 9.5 of [21], and hence 
we know it has a solution. 



Lemma 4 ([21]). There is a rational solution [x, (z q ) q eQ) £ Q' G ' +1 to X, and the encod- 
ing size 3 of the solution is polynomial in 

3 Recall that rational numbers are represented as fractions of binary numbers. 



Val(9(0) =1+Z ?(( .^ r0 ^-Val(r(j)) 



Valte'(O) = 1 +min{Val(r(j)) | q'(i)^r(j)}. 
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Note that x > — 1, because for any fixed x < -1 the program SL trivially has a feasible 
solution. Further, we put V := max 9£ gz 9 - min 9 <=gz 9 . Observe that V e exp(||J?l|| 0(1) J 
and V is computable in time polynominal in \\Jl\\. 

Proposition 5. Let [x, (Zq)qeoj be a solution of £. 

(A) Ifx > 0, then Va\(q(i)) = oofor all q e Q and i > \Q\. 

(B) Ifx < 0, then the following holds: 

(B.l) For every strategy n and all q e Q, i > we have that E"q(i) > (i - V)/\x\. 
(B.2) There is a counterless strategy o~ and a number U e exp (llJ^lP^) such that 

for all q e Q, i > we have that Wq{i) < (i + U)/\x\. Moreover, cr and U are 

computable in time polynomial in 

First, let us realize that Proposition 5 implies Theorem 3. To see this, we consider 
the cases x > and x < separately. In both cases, we resort to analyzing a finite- 
state MDP Qk, where K is a suitable natural number, obtained by restricting Af^ to 
configurations with counter value at most K, and by substituting all transitions leaving 
each p{K) with a self-loop of the form p(K)^ p(K). 

First, let us assume that x > 0. By Proposition 5 (A), we have that Val(g(0) = oo for 
all q e Q and i > \Q\. Hence, it remains to approximate the value and compute e-optimal 
strategy for all configurations q(i) where i < \Q\. Actually, we can even compute these 
values precisely and construct a strategy <r which is optimal in each such q(i). This is 
achieved simply by considering the finite-state MDP Q\q\ and solving the objective of 
minimizing the expected number of transitions needed to reach a state of the form p(0), 
which can be done by standard methods in time polynomial in \\^l\\. 

If x < 0, we argue as follows. The strategy cr of Proposition 5 (B.2) is not neces- 
sarily e-optimal in q(i), so we cannot use it directly. To overcome this problem, con- 
sider an optimal strategy n* in q(i), and let X( be the probability that a run initiated 
in q(i) (under the strategy n*) visits a configuration of the form r(i + €). Obviously, 
X( ■ mm r£ Q{W" r(i+€)} < Wq{i), because otherwise n* would not be optimal in q(i). 
Using the lower/upper bounds for W" r{i+€) and E°"^(/) given in Proposition 5 (B), we 
obtain X( < (i + U)/(i + I - V). Then, we compute k e N such that 

Xf^max{(/ + fe+f/)/|x|-E jr V(/+fe)}j < e 

A simple computation reveals that it suffices to put 

H + UW + V) T7 . 

k > — + V-i 

s\x\ 

Now, consider Qi+k, and let / be a reward function over the transitions of Q i+ k such 
that the loops on configurations where the counter equals or i + k have zero reward, a 
transition leading to a state r(i+k) has reward (i + k + U)/\x\, and all of the remaining 
transitions have reward 1 . Now we solve the finite-state MDP Q^k with the objective 
of minimizing the total accumulated reward. Note that an optimal strategy q in Qi+k is 
computable in time polynomial in the size of Q i+ k [21]. Then, we define the correspond- 
ing strategy & in AfS, which behaves like q until the counter reaches i + k, and from 
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that point on it behaves like the counterless strategy cr. It is easy to see that & is indeed 
e-optimal in q(i). 

Proof of Proposition 5. Similarly as in [4], we use the solution (x, (z ? ) ?e g) e Q IGI+1 of 
X to define a suitable submartingale, which is then used to derive the required bounds. 
In [4], Azuma's inequality was applied to the submartingale to prove exponential tail 
bounds for termination probability. In this paper, we need to use the optional stopping 
theorem rather than Azuma's inequality, and therefore we need to define the submartin- 
gale relative to a suitable filtration so that we can introduce an appropriate stopping 
time (without the filtration, the stopping time would have to depend just on numerical 
values returned by the martingale, which does not suit our purposes). 

Recall the random variables {C (,) },>o and {S (,) },>o returning the height of the counter, 
and the control state after completing i transitions, respectively. Given the solution 
(x, (Zq) q eQ) e Q IGI+1 from Lemma 4, we define a sequence of random variables {/n w } ( >o 
by setting 



Note that for every history u of length ; and every < j < i, the random variable 
returns the same value for every co e Run(u). The same holds for variables S (j) and C (i) . 
We will denote these common values m w (w), S (j) (m) and C (j) (m), respectively. Using the 
same arguments as in Lemma 3 of [4], one may show that for every history u of length 
i we have E(m (!+1) | Run(u)) > m^'\u). This shows that {m w },->o is a submartingale 
relative to the filtration {^ ),>o, where for each i > the cr-algebra Ti is the cr-algebra 
generated by all Run(u) where len(u) = i. Intuitively, this means that value m ( '\a>) 
is uniquely determined by prefix of w of length i and that the process {m w },>o has 
nonnegative average change. For relevant definitions of (sub)martingales see, e.g., [23]. 
Another important observation is that |m (,+1) - m w | < 1 + z + V for every i > 0, i.e., the 
differences of the submartingale are bounded. 

Lemma 6. Under an arbitrary strategy r and with an arbitrary initial configuration 
q(j) where j > 0, the process {m (,) };>o is a submartingale (relative to the filtration 
(■^liaoJ with bounded differences. 

Part (A) of Proposition 5. This part can be proved by a routine application of the op- 
tional stopping theorem to the martingale {m®},>o. Let z max := max 9£ gz 9 , and consider 
a configuration p(C) where ( + z r > Zmax- Let cr be a strategy which is optimal in every 
configuration. Assume, for the sake of contradiction, that Val(p({)) < oo. 

Let us fix k e N such that £ + z r < z m - dx + k and define a stopping time r which 
returns the first point in time in which either m (T) > z max + k, or m (T) < z max . To apply 
the optional stopping theorem, we need to show that the expectation of t is finite. 

We argue that every configuration q(i) with ; > 1 satisfies the following: under the 
optimal strategy cr, a configuration with counter height i - 1 is reachable from q(i) in 
at most \Q\ 2 steps (i.e., with a bounded probability). To see this, realize that for every 
configuration r(j) there is a successor, say r'(j'), such that Val(r(y)) > Val(r'(/)). Now 
consider a run w initiated in q(i) obtained by subsequently choosing successors with 
smaller and smaller values. Note that whenever w(j) and w(j') with j < f have the 




C® + z s <o — i • x if C (J) > for all < j < i, 



otherwise. 
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same control state, the counter height of w(j') must be strictly smaller than the one of 
w(j) because otherwise the strategy cr could be improved (it suffices to behave in w(j) 
as in w(j')). It follows that there must be k < \Q\ 2 such that the counter height of w(k) is 
i — 1 . From this we obtain that the expected value of t is finite because the probability of 
terminating from any configuration with bounded counter height is bounded from zero. 
Now we apply the optional stopping theorem and obtain P^(m <T) > z mdx +k) S c/(k+d) 
for suitable constants c, d > 0. As m {T) > z max + k implies C (t) > k, we obtain that 

P£ (r>*) > P£ (C< T >>*) > p; w (m«>z max + fc) > ^ 

and thus 

oo CO 

= Y*W**> > Y^TTd = 00 

k=\ k=i 

which contradicts our assumption that <x is optimal and Val(p({)) < 00. 

It remains to show that Wal(p({)) = 00 even for I = \Q\. This follows from the 
following simple observation: 

Lemma 7. For all q e Q and i > \Q\ we have that Val(g(/)) < 00 iffVei(q(\Q\)) < 00. 

The "only if" direction of Lemma 7 is trivial. For the other direction, let &k denote the 
set of all p e Q such that Val(p(k)) < 00. Clearly, S Q = Q, B k Q &t-u and one can 
easily verify that = S^+i implies = S^+i for all I > 0. Hence, S\q\ = B\q\+( for 
all (. Note that Lemma 7 holds for general OC-MDPs (i.e., we do not need to assume 
that Mj[ is strongly connected). 

Part (Bl) of Proposition 5. Let n be a strategy and q(i) a configuration where i > 0. 
If E"q(i) = 00, we are done. Now assume Wq(i) < 00. Observe that for every k > 
and every run a>, the membership of a> into {T < k] depends only on the finite prefix of 
to of length k. This means that T is a stopping time relative to filtration {T n ) n >o- Since 
Wq(i) < 00 and the submartingale {m (,l) }„>o has bounded differences, we can apply the 
optional stopping theorem and obtain E ?r (m (0) ) < W(m {T) ). But E ?r (m (0) ) = i + z q and 
E*(m m ) = E"z s m + E"q(i) ■ \x\. Thus, we get E"q(i) > (i + z q - E 7r z s( n)/|x| > (i - V)/\x\. 

Part (B2) of Proposition 5. First we show how to construct the desired strategy cr. 
Recall again the linear program X of Figure 1 . We have already shown that this program 
has an optimal solution (x, (z 9 ) 9£ g) e Q IGi+1 , and we assume that x < 0. By the strong 
duality theorem, this means that the linear program dual to £, also has a feasible solution 

{(y q )qeQ , (y(q,i,q'))qzQl, (q,',q')€s)- Let 

D = {qeQ () \y <] >0}U{qeQ l \ y (qJ ^ } > for some (q, i, q) e 6}. 

By Corollary 8.8.8 of [21], the solution {iy q ) qe Q , {y{ q xq'))qeQ u (qM')^) can be chosen so 
that for every q e Q\ there is at most one transition (q, i, q') with y( q j, q >) > 0. Follow- 
ing the construction given in Section 8.8 of [21], we define a counterless deterministic 
strategy cr such that 
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- in a state q e D n Q\, the strategy cr selects the transition (g, ;', q') with > 0; 

- in the states outside D, the strategy cr behaves like an optimal strategy for the ob- 
jective of reaching the set D. 

Clearly, the strategy cr is computable in time polynomial in To show that cr indeed 
satisfies Part (B.2) of Proposition 5, we need to prove a series of auxiliary inequalities, 
which can be found in Appendix A. 1 . 

3.2 General OC-MDP 

In this section we prove Theorem 3 for general OC-MDPs, i.e., we drop the as- 
sumption that Mj[ is strongly connected. We say that C c Q is an end compo- 
nent of J[ if C is strongly connected and for every p e C n Qo we have that 
{q £ Q | p~^>q] C C. A maximal end component (MEC) of M is an end compo- 
nent of J{ which is maximal w.r.t. set inclusion. The set of all MECs of J?l is de- 
noted by MEC(J[). Every C e MEC(Jl) determines a strongly connected OC-MDP 
Oc = (C, (C n Qo, C n 2i), 8 n (C X {+1, 0, -1} X C), {P q } q ecn Qo )- Hence, we may apply 
Proposition 5 to Mc, and we use x c and V c to denote the constants of Proposition 5 
computed for J{ C - 

Part 1. of Theorem 3. We show how to compute, in time polynomial in ||j?l||, the set 
Qfin = {p e Q | V&l(p(k)) < oo for all k > 0}. From this we easily obtain Part 1. of 
Theorem 3, because for every configuration q(i) where ;' > we have the following: 

- if i ^ 121. then Val(#(z')) < oo iff q e Qfl„ (see Lemma 7); 

-Hi < 121, then Val($(i)) < co iff the set [p(0) \ p e g) U (p(|2l) I p e Q fin ] 
can be reached from q(i) with probability 1 in the finite-state MDP @\q\ defined in 
Section 3.1 (here we again use Lemma 7). 

So, it suffices to show how to compute the set Qfi„ in polynomial time. 

Proposition 8. Let g<o be the set of all states from which the set 
H = \q e Q\q belongs to a MEC C satisfying 5c c < 0} is reachable with probabil- 
ity 1. Then Qfi n = 2<o- Moreover, the membership to 2<o is decidable in time 
polynomial in 

Part 2. of Theorem 3. First, we generalize Part (B) of Proposition 5 into the following: 

Proposition 9. For every q e Qfi„ there is a number t q computable in time polynomial 
in ||j?l|| such that -I <t q <0, l/\t q \ e exp (ll^ll 00 '), and the following holds: 

(A) There is a counterless strategy cr and a number U € exp(||y[|P^) such that for 
every configuration q(i) where q £ Qfi„ and i > we have that Wq{i) < i/\t q \ + U. 
Moreover, both cr and U are computable in time polynomial in \\M\\. 

(B) There is a number L £ expOI^H^ 1 ') such that for every strategy n and every config- 
uration q(i) where i > \Q\ we have that W > i/\t q \ - L. Moreover, L is computable 
in time polynomial in 
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Once the Proposition 9 is proved, we can compute an e-optimal strategy for an arbitrary 
configuration q(i) where q e Qp n and ;' > \Q\ in exactly the same way (and with the 
same complexity) as in the strongly connected case. Actually, it can also be used to 
compute the approximate values and e-optimal strategies for configurations q(j) such 
that q t Qfi n or 1 < j <\Q\. Observe that 

- if q i Qfi n and j > \Q\, the value is infinite by Part 1; 

- otherwise, we construct the finite-state MDP Q\q\ (see Section 3.1) where the loops 
on configurations with counter value have reward 0, the loops on configurations 
of the form r(\Q\) have reward or 1, depending on whether r e Q/m or not > tran- 
sitions leading to r(\Q\) where r E (2/i« are rewarded with some e-approximation 
of Val(r(|(2|)), and all other transitions have reward 1. The reward function can be 
computed in time exponential in by Proposition 9, and the minimal total accu- 
mulated reward from q(j) in Q\q\, which can be computed by standard algorithms, 
is an e-approximation of Val(g(j)). The corresponding e-optimal strategy can be 
computed in the obvious way. 

The proof of Propositions 8 and 9 can be found in Appendices A. 2 and A. 3, respectively. 

4 Lower Bounds 

In this section, we show that approximating Val(g(;)) is computationally hard, even if 
;' = 1 and the edge probabilities in the underlying OC-MDP are all equal to 1/2. More 
precisely, we prove the following: 

Theorem 10. The value of a given configuration q(l) cannot be approximated up to 
a given absolute/relative error e > unless P-NP, even if all outgoing edges of all 
stochastic control states in the underlying OC-MDP have probability 1/2. 

The proof of Theorem 10 is split into two phases, which are relatively independent. 
First, we show that given a propositional formula (p, one can efficiently compute an 
OC-MDP 3K, a configuration p(K) of SI, and a number N such that the value of p(K) is 
either N - I or N depending on whether <p is satisfiable or not, respectively. The num- 
bers K and N are exponential in ||^||, which means that their encoding size is polynomial 
(we represent all numerical constants in binary). Here we use the technique of encoding 
propositional assignments into counter values presented in [19], but we also need to 
invent some specific gadgets to deal with our specific objective. The first part already 
implies that approximating Val(gO')) is computationally hard. In the second phase, we 
show that the same holds also for configurations where the counter is initiated to 1 . This 
is achieved by employing another gadget which just increases the counter to an expo- 
nentially high value with a sufficiently large probability. The two phases are elaborated 
in Lemma 26 and Lemma 29 which can be found in Appendix A.4. 
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A Appendix 



First, let us fix some additional notation that will be used throughout the whole ap- 
pendix. 

Given a random variable X we denote E^X the expected value of X computed 
under strategy n from initial configuration q(i). Since the initial configuration will be 
fixed in most of the proofs, we will usually omit the subscript and write only WX. 

Also, given a random variable X and an event A, we use E(X | A) to denote the 
conditional expectation of X given the event A. 

We also use p^ in to denote the minimal positive transition probability in Si. We will 
usually omit the superscript Si'\iSi'v& clear from the context. 

We say that (finite or infinite) path u in OC-MDP Si hits or reaches a set D c Q if 
S ( '\u) e D for some ;' < len(u). We say that u evades D if it does not hit D. 

A.l Proof of part (B.2) of Proposition 5. 

First, denote SV the finite one-counter Markov chain that results from application 
of counterless strategy u on SI. That is, Si a = (Q,(Q,Q)),d, P°~) where P a q (q,i,r) = 
P(q, i, r) for every stochastic state q of Si, while for every non-deterministic state q of 
Si we have that P a q is a Dirac distribution that gives probability 1 to transition selected 
by a-(q). 

Theorem 8.8.6 of [21] now guarantees that the set D is exactly the set of all recurrent 
states in SV T . In particular, in SV 7 there is no transition leaving D. 

Now, let us recall a fundamental result from theory of linear programming, the Com- 
plementary slackness theorem. In essence, this theorem states that whenever we have a 
pair of solutions u and v of the primal and dual linear program, respectively, then the 
following equivalence holds: The j-th component of v is positive iff u satisfies the j-th 
inequality of the primal linear program as an equality. We can apply this on our pair of 

solutions (x,(z q ) q€Q ), { < Jq)qeQ a ,{y(q,i,q'))qeQ l ,{qM')es) t0 obtain the following system of 
linear equations: 



z q = -x + k + z r whenever q e Q\ n D and cr selects (q, k, r), 

Zq = ~X+ ^(qMeS Pqdq, K r)) ■ (k + z r ) for all q E Q () n D. 

With the help of these equations, we can easily prove the following lemma: 

Lemma 11. Under strategy cr, for any initial configuration q(i) with q e D and for 
any history u of length i we have E°"(m (,+1) | Run(u)) = m^'\u). That is, {m (, ')j>o is a 
martingale relative to the filtration {Ti)i>o- 

Proof. The proof is the same as the proof of Lemma 22 in [8]. □ 

From results of [8] (where termination time of one-counter Markov chains was stud- 
ied) it follows that under strategy cr the expected termination time is finite from every 
initial configuration of the form q(i) with q e D. To be more specific, we can prove the 
following: 
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Lemma 12. For every initial configuration q(i) with q e D there are numbers N £ N, 
< a < 1 such that for every n > N we have P^ (j) (T - n) < a". 

Proof. The proof is the same as proof of Proposition 7 in [8]. □ 
The finiteness of termination time easily follows because 

E"q(i) = J]k- pyr = k) < n + J] k • pyr = fe) < n + J] k ■ ° k < <*>• 

ket-S k>N keN 

Corollary 13. For any initial configuration q(i) with q £ D we have Wq(i) < oo. 

Lemma 14. For any initial configuration q(i) where q e Dwe have Wq(i) < (i+V)/\x\. 

Proof. As in proof of part (B. 1) of Proposition 5 we want to use the Optional stopping 
theorem to prove that E°"m (0) = Wm {T \ We just need to verify that the assumptions 
of this theorem hold. We have already argued that {m w },>o is a martingale and T is a 
stopping time relative to the same filtration {Ti)i>Q. We have also observed that {m w },>o 
has bounded differences. From the previous corollary we also now, that the expectation 
of stopping time T is finite. Thus, the Optional stopping theorem applies and we indeed 
have Wm^ = ¥,' T m {T \ But E^m* ' = i + z q and E°"m (r) = E^m + |x| • Wq{i). This 
gives us Wq{i) = (i + z q - Wz s m)l\x\ < (i + V)/\x\. □ 

To prove part (B.2) of Proposition 5 it remains to prove the upper bound for arbitrary 
initial state. Intuitively, every state outside D is transient in ^ and thus under cr we 
must reach D "quickly". Once D is reached, we can apply the bound from previous 
lemma. 

Lemma 15. Let q(i) be any initial configuration. Denote p := exp(-p m i n ' G '/l2l) where 
Pmn is the minimal nonzero probability in 3K. Then we have 



i + V + 2\Q\ + 



4 



d-p) 2 



E T q(i) < 

\x\ 

Before we prove Lemma 15, we should mention that the Lemma directly implies 
inequality in part (B.2) of Proposition 5. Indeed, the desired inequality holds for 
U = V + 2\Q\ + ^r~)2 ■ The required asymptotic bound on U is easy to check: we 
just need to recall, that for every real number x £ [0, 1] we have 1 - exp(-x) > x/2 and 
thus 1/(1 - p) 2 < 4\Q\ 2 / p 2 ^ - This also shows that U is computable in time polynomial 
in\m\. 

Proof (Proof of Lemma 15). We can write 

E a q(i) = E' r (r 1 + T 2 ), 

where T\{oS) = k iff k is the first point in time when either C (k \<x>) = or S (k \co) £ D; 
and where T2 returns the termination time measured from the first time when D was hit 
(formally we have ^(w) = -T\(oS) + T(oS) if T\(cj) < 00 and T2U0) = otherwise). We 
will bound expectations of T\ and T2 separately. 
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Let's start with T\. Any run to with T\(to) > k must either terminate before hitting 
D but after at least k steps; or it has to hit D after at least k steps. In both cases to has to 
evade D for at least k - 1 steps. From e.g. Lemma 23 of [8] we know, that probability 
of evading D for at least k - 1 > \Q\ - 1 steps is at most 2p k . We get 



WT, = Z k ■ Kio^ = V = Z P ?© (Tl - k) - 121 + Z P ?» (Tl * fe ) 

/t=l k=l k=\Q\+l 

<|fi|+ 2 V<IGI + Z 2 ^^I2I + t^- (D 

*=iei+i k=o P 

Let us now concentrate on T 2 . For any I > Owe denote D; the set of all runs that 
terminate after hitting D and have a counter value I when they hit D for the first time. 
(Formally, to € D x iff Ti(w) < 00, e D and C (ri) (w) = /.) We also denote D 

the set of all runs that reach a configuration with zero counter before or simultaneously 
with hitting D for the first time. Then we have 

00 

E (r T 2 = Y J E (r (T 2 \D,)-F^ (j) (D l ). (2) 
1=0 

Note that by Lemma 14 we have for every / e N 

l + V 



E ,T (r 2 1 do < 



\x\ 

Particularly for every I < i + \Q\ we have 

^{T 2 I A) < prp + 7^. (3) 

On the other hand, for Z > i + \Q\ we have P^ (;) (D/) < 2p l ~', since no run in D t can 
hit D in less than / - ;' steps. Moreover, for I >i + \Q\ we can write 

i + (l-i) + V i + V (I- i) 

BT{T 2 I Di) < —±—> = — - + ( —r±. (4) 

\x\ \x\ \x\ 

Plugging (3) and (4) into (2) we can compute 

i + V \Q\ A 2-(l-i)-p l ~ i i + V Ifil 2 



1*1 1*1 HI HI |i| \x\-(l-p) 2 

Putting (1) and (5) together we obtain 



4 



- \x\ \x\ \x\-(l-p) 2 ™~ \x\ 
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A.2 Proof of Proposition 8 



First, consider the membership problem for g<o- A decomposition of Q into maxi- 
mal end components can be computed in polynomial time using standard algorithms 
(see, e.g. [21]). By solving the system X. for individual MECs, we obtain the trends xc 
that in turn determine the set H. Finally, solving, in polynomial time, the qualitative 
reachability of H for every state q we obtain the set Q <0 . 

It remains to prove that Qf, n = (2<o- We prove both inclusions separately. 
Assume that p e £><o- First, observe that if p belongs to a MEC C satisfying x c < 
then, by Proposition 5, there is a counterless strategy which stays in C and terminates 
in finite expected time. In particular, Val(p(£)) is finite and depends linearly on I. 

Assume that a strategy cr almost surely reaches H from p in As almost sure 
reachability is solved using memory-less strategies in finite MDPs, we may assume that 
cr is memory-less. Denote by *H the set of all configuration of the form q(€) where either 
q e H, or £ = 0. The strategy cr induces a counter-less strategy cr' in which reaches 
'H with probability one. Moreover, using cr', H is reachable with a positive probability 
from any configuration in at most \Q\ steps. This means that the expected time to reach 
<H is finite and the probability of reaching a configuration of *H with counter value 
at most I before any other configuration of 7f is bounded by ^ for suitable constants 
c, d > 0. As Val(q(£)) depends linearly on I for every q e //, we obtain that the expected 
termination time for p(k) is finite. 

'c': We proceed by contradiction. Assume that Qf in \ <2<o + 0. The following Lemma 
formalizes the crucial idea. 

Lemma 16. Assuming Qf m \ (2<o + 0, there is a MEC C satisfying C c g/m \ Q<ofor 
which the following holds: if s"~> t where s e C and t iC, then t e Q \ Q fin- 
Proof. First, we prove that if Q f in \ (2<o + 0, then it contains at least one MEC. Assume, 
to the contrary, that all MECs contained in Qf in are also contained in g<o. We claim 
that then g/m £ Q<o- Indeed, consider p e Qf ln \ g<o. Note that starting in p, almost 
every run eventually reaches a MEC no matter what strategy is used. Moreover, there 
is a strategy which almost surely stays within Q/,„ forever starting in p. Using such a 
strategy, almost all runs initiated in p reach MECs contained in Q ji n and hence also in 
(2<o. Thus, by definition of g<o, we have p 6 (2<o which contradicts p £ g/m \ 2<o- 

If there is a MEC C Q Q /,„ \ g<o such that no transition s ~» f satisfies s e C 
and t i C, then we are done. Assume, to obtain a contradiction, that for every MEC 
C Q Qfi„ \ Q<q there is s c ~» fc such that s c € C but t c e Q/,„ \ C. Then for every 
f c there is a strategy which stays within Q/ m . Let us consider a strategy 7r that does the 
following: 

- in all states of every MEC C satisfying C c Qy in \ g <0 , the strategy n strives to 
reach s c with probability one 

- in each sc, the strategy n takes the transition sc ~» fc with probability one 

- in states of Qf in that do not belong to any MEC, the strategy n stays in Qfi„. 

Note that we may safely assume that n is memory-less. Consider the Markov chain M" 
induced by n on states of Qfin- There are two possibilities. First, every bottom strongly 
connected component (BSCC) of M" contains a state of Q K o- Then g<o is reachable 
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with probability one using n from states of Qf in \ Q k q, a contradiction with definition 
of g <( ). Assume that there is at least one BSCC of M„ which does not contain states of 
2<o- However, then the BSCC contains only states of Qfj n \ <2<o- Thus, by definition 
of n, the BSCC must contain at least two MECs, a contradition with the definition of 
MEC. □ 

Now let ibe a counter value such that for every q e Q\Qfi„ we have that Va\(q({)) = 

00. Let cr be a strategy and consider p{l + \Q\) where p e C. We prove that p cannot 
belong to g/m which contradicts C c g/m \ Q <Q . 

There are two cases. First, assume that using cr, a configuration of the form q(k), 
where k > I and q e Q \ C, is reachable via configurations with counter values at 
least £ whose control states belong to C. Then by Lemma 16, q e Q \ Qf in and thus the 
expected termination time from q({) is infinite. It follows that the termination time from 
p(f + 121) using cr is infinite as well. Assume that there is no such a path, i.e. that the 
only way how to leave C from p{€ + \Q\) using cr is to decrease the counter value below 

1. But then the expected termination time from p(t + \Q\) using cr is at least as large as 
Wal(p(\Q\)) in Jl c , which is infinite by Proposition 5 due to x c > 0. In both cases we 
obtain that p £ Qfin, a contradiction. 

Note that the number I mentioned above can be bounded from above by \Q\ by 
Lemma 7. 



A.3 Proof of Proposition 9 

First we introduce some notation: for any run to we denote inf(w) the set of states that 
are visited infinitely often by u. For any MEC C we denote M c = {to | inf(w) c Q. It 
is well known that under arbitrary strategy n we have F^U ceMEC(jt) 

M c ) = l, i.e. that 

inf(w) is almost surely contained in some MEC. 

For any state q denote Z< Q the set of all strategies cr with the property that 
P^({w | co e M c , x c > 0)) = 0. Note that by Proposition 8 we have Zf + for all 

q e Qfin- 

Let us start with part (B) of Proposition 9. We want to describe a counterless strategy 
cr that terminates "quickly" from any configuration q(j) with q e Qfin- Part (B2) of 
Proposition 5 gives us for every MEC C counterless strategy cr c such that for any initial 
configuration q(i) with q e C we have E°" c g(/) < (z + Uc)l\xc\, for some number U c . 
Main idea behind construction of cr is to stitch these strategies together in appropriate 
way. 

We argue that the following should hold: First, strategy cr should be in 2^° for all 
states q e Qfi n - Otherwise, the finite Markov chain induced by cr on states of J?[ 
would have some bottom strongly connected component (BSCC) contained in MEC C 
with jc c > 0. By part (A) of Proposition 5 this would mean that E' T q( j) = oo for some j. 

Second, strategy cr should minimize the long-run average number of steps needed to 
decrease the counter value by one. Note that since x c represents the minimal long-run 
average change in counter value in MEC C, the number \xc\~ 1 represents exactly the 
long-run average time needed to decrease the counter by 1 in C, provided that xq < 0. 
Thus, strategy cr should minimize the weighted sum YtCeMEC(JK)^^Mc) ' \ x cV l f° r 
any initial configuration q(i) with q e Qfi„. Note that the objective of minimizing 
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2ceM£C(^!) P^i/^c) 1 l^cl ' does not depend in any way on counter values so it suf- 
fices to show that the sum is minimized for some (unspecified) initial counter value 
i. 

More formally, for every state q e Qfi„ there is unique number t q < such that 
I^T 1 = infj^o HceMEC(j{) P£(,-)(^c) - l*d _1 for all i. We call t q the minimal trend achiev- 
able from q. Our goal is to find counterless deterministic strategy cr e such that 
ZceMEcm P ^ (! / M c) • \ x c\~ l = \tg\~\ for every q e Q fin and every i. 

Denote x,q = max[x c | C e MEC( Jl), x c < 0}. In order to compute strategy cr and 
numbers t q , we transform J\ into a new finite-state MDP with rewards Sl R by "forget- 
ting" counter changes in J[ and defining a reward function R on transitions in J[ as 
follows: 

ij- if s, t € C and x c < 
^]j5P otherwise, 
^min 

It is clear that yi« can be constructed in time polynomial in 

Claim. In 3K R the maximal average reward achievable from state q e Qf in is equal to 
t^ 1 . Moreover, there is a memoryless deterministic strategy cr R in Sl R such that for every 
state q e Q fin we have cr R e Ef and 2 CeM£C(jH) P^ R (M C ) • \xc\~ 1 = \t q \~ l . 

Proof. The existence of optimal memoryless deterministic strategy cr for maximization 
of average reward follows from standard results on MDPs (see [21]). It is obvious that 
for any q e Qj in and any strategy n e 2^° the average reward obtained with strategy 
n in !A R is equal to YiCeMEC(m) ^JMc) ■ \xc\~ l - It thus suffices to prove that cr R e 2^° 
for every q e Qfin- Denote M°~ R the finite Markov chain induced by cr R on states of SI. 
Assume, for the sake of contradiction, that cr R for some q e Qfin- Then there 

must be a BSCC B of M°~ R reachable from q that is contained in some MEC C with 
xc > 0. In M aR there must be a path of length at most | Q\ from q to B, which means that 

under cr R the probability of runs that have average reward ^—1 [ s a t least p|^' n . Since no 

^min 

run in 3\ R has average reward greater than -1, it follows that average reward achieved 
from q with cr R is at most x Q l - 1 - (1 - p|^' n ) < x Q x . But this is contradiction with 
cr R maximizing the average reward, since + and every strategy from yields 
average reward at least x Q 1 . □ 

Strategy cr R can be computed in polynomial time with standard algorithms (see, e.g., 
[21]). We can now construct the desired counterless strategy cr as follows: denote SV R 
the finite Markov chain induced by cr R on states of J\. Note that every bottom strongly 
connected component of SV R is contained in exactly one MEC C(B) of &\. Strategy cr 
behaves in the same way as cr R until some BSCC B of Sl°~ R is reached. Then cr starts to 
behave as <Tc(b)- It is easy to see that cr e for all states q e Qfi„. 

Clearly, for every MEC C and every initial configuration q(i) with q e Qf in we 
have (M c ) = ^ (j) (M c ) and thus also Z C eMEC(w ^ (l) ( M c) ■ fcT 1 = |f,l -1 for every i. 
Note that numbers t q satisfy all conditions mentioned in the initial part of Proposition 9. 
Moreover, we can prove the following upper bound on expected termination time under 
cr: 
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Proposition 17. There is a number U e exp (||J?(|| < " )(1) ) that is computable in time poly- 
nomial in such that for any initial configuration q(i) with q E Qf in and i > \Q\ we 
have 

E' T q(i) <^- + U. 
Vq\ 

Proof. The proof closely follows the proof of Lemma 15. However, there is a new 
obstacle in a presence of components with different trends. 

Since the strategy cr is memoryless, its application on J[ yields a finite one-counter 
Markov chain !W . Denote D the union of its bottom strongly connected components. 
We can now write 

E°- 9 (0 = BT(Ti + T 2 ), (6) 

where again T\ (cj) = k iff k is the first point in time when oj hits either D or reaches 
a configuration with a zero counter and T 2 is a time to hit a configuration with a zero 
counter after hitting D (T 2 returns zero if the run never terminates or terminates before 
hitting D). 

We will bound expectations of T\ and T 2 separately. 

The bound on E°Ti can be computed in exactly the same way as in Lemma 15. 
Thus we can conclude that 

ETj < \Q\ + (7) 
1 -p 

where p = exp(-p> min IGI /|g|). 

Now we bound the expectations of T 2 . Recall that for any I > we denote D[ 
the set of all runs that terminate after reaching D and have a value counter value ex- 
actly I when they hit D for the first time (and we denote Do the set of all runs that 
terminate before hitting D or hit D with counter value exactly 0). Also recall, that Mc 
denotes the set of all runs u with inf(<jj) c C and that under arbitrary strategy n we have 
ZceMEcm W*(M C ) = 1 • Finally, denote Df = M c n D h 

As discussed in section 3.2, we can apply Proposition 5 to every MEC C of J{ 
separately. Especially, by construction of cr the following holds: for every MEC C that 
contains some BSCC of W , the Proposition 5 gives us number U c & exp(||j?l|| 0(1) ) such 
that E°>(j) < (; + U c )/\x c \, for every p e C and j > 0. Set 

U' = max{U c I C £ MEC(3\\ C contains some BSCC of 

Clearly we still have U' e exp(||J?l|| 0(1) ). 
We have 

CO 

E<rr2 = Z X^^^-^oK)- (8) 

CeMEC(Jl) 1=0 

As in proof of Lemma 15, we can easily show that for any MEC C and any I < i+\Q\ 
we have 

E ^ 2 |Z)f),i±M±Z. (9) 
\xc\ 

For every C and every / > i + \Q\ we have 

ITO | Df) < + ^ (10) 

\xc\ \xc\ 
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and P^CA) < V~' (the latter holds by Lemma 23 of [8]). 

Recall that we have denoted x = max{x c | C E MEC(ff), x c < 0}. Putting (9) and 
(10) together we obtain for any fixed C e MEC( Jl) 

/=o 1 cl /=i+iei 1 cl 

I*cl 9() ,if e| Ixol 

Moreover, from the definition of strategy cr we know that 
YjCeMEC^^(i)(^c) • l^d -1 = l^l -1 - We can use this and continue from (8) as fol- 
lows: 




1X01 



~ (/-0- Zc€M£C(^)Pg (0 (Pf) 

(Af c ) +2j 



: (0 (a) ^ ,-+K2i + t/' 2 

M ^ |io1 ~ W liol-d-p) 2 ' 

Combining (7) and (8) we can conclude that 

^ _ i 2\Q\ + U' 4 i 2\Q\ + U' 4 

E^(0 < — + + — — ; -r < — + 



\t q \ \t q \ M ■ (1 - p) 2 ~ \t q \ |Iol |Xo|-(l-p) 2 ' 

The inequality in Proposition 17 thus holds for J/ = ^jrp- + j^jf^jr • The desired 
asymptotic bound is again easy to check. □ 

It remains to prove part (B) of Proposition 9 (with numbers t q being the minimal 
trends achievable from q). 

The following Claim shows, that in order to prove Proposition 9 (B) it suffices to 
prove its validity for strategies in because termination value under some arbitrarily 
fixed strategy can be approximated up to some exponential error by termination value 
under suitable strategy from 27*°. 

Claim. There is a number € exp(||J?[|| 0(1) ) that is computable in time polynomial in 
with the following property: for every strategy n and any initial configuration q(i) 
with i > \Q\ there is a strategy n' e Zf such that E"q(i) > E"'q(i) - K x . 

Proof. Set K\ = i\Q\ + U)/\xo\, where U is the constant from Proposition 17. Fix ar- 
bitrary strategy n. If Wq(i) = oo, then the inequality clearly holds for any strategy 
n' e Otherwise, since i > \Q\, with n we must almost surely reach a configuration 
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of the form p(\Q\). For every such reachable configuration we must have p € Qf in , since 
otherwise we would have E"q(i) — oo by Lemma 7. Define new strategy n' as follows: n' 
behaves in the same way as n until the configuration with counter height \ Q\ is reached: 
then it starts to behave as strategy <x from Proposition 9. Then clearly n' e and by 
Proposition 17 the switch to strategy cr in height \Q\ cannot delay the termination for 
more than {\Q\ + U)/\xo\ steps. □ 

Under strategy n e we never reach state from Q \ Q/i„, if we start from q. 
We can thus safely remove all states from Q \ Qfi„, together with adjacent transitions, 
without influencing the behavior under strategies from 2^°. In the following we can 
without loss of generality assume that Q = Qf in and that all strategies are in 2T<°, for 
every state q. 

We will now finish the proof in two steps. First, we observe that there is only a 
small probability that the run revisits (i.e. leaves and then visits again) some MEC 
many times. Actually, this probability decays exponentially in number of revisits. We 
call a transition r(j)^>r'{j') in Af^j a switch if there exists some MEC C such that 
\{r, r'} n C\ = 1. For any run lo we denote %<jS) the number of switches on a> and we 
set W(u>) = )t(w) + 1 . That is, random variable W counts the number of maximal time 
intervals in which oj either stays within a single MEC or outside any MEC. 

Lemma 18. For every strategy n, every initial configuration q(i) and every k € N 

r q(i) (w = k) < 8 • \Q\ ■ c\ 



where c = exp 




Proof. If p m i n = 1, i.e. there are no (truly) stochastic states, then MECs are actually 
strongly connected components, W(aj) < 2 ■ \ Q\ for every run a>, and the Lemma trivially 
holds. Otherwise, we have p m i n < 1/2. We can use the following: 

Claim. Let 31 be arbitrary OC-MDP and let C be a MEC of 9L Further, let q <£ C be 
any state that can be reached from C with probability 1 (under some strategy). Then, 
under arbitrary strategy, the probability of reaching C from any initial configuration of 
the form q(i) is at most 1 - . 

1 ■ ' 1 mm 

Let p be the strategy that maximizes the probability of reaching C from q in 31m- From 
standard results on MDPs we may assume that p is memoryless. Denote M p the finite 
Markov chain induced by p on states of 31m- There must be at least one BSCC B of M p 
reachable from q such that in 31m the probability of reaching C from any state of B is 
less than 1 under any strategy (otherwise, there would be a strategy that almost surely 
reaches C from q - a contradiction with C being a MEC). In particular, sets B and C are 
disjoint. Thus, the probability of not reaching C from q under p is at least as large as 
probability of hitting B in M p . Since p is memoryless, there is a run in M p that reaches 
B in at most \Q\ steps. Thus, the probability of hitting B is at least /?[^' n . 

Let us now finish proof of the Lemma. For any MEC C and any / e N denote R l c 
the set of all runs that leave a MEC C and then return to it for at least / times. The claim 
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shows that under any strategy n we have V^yR l c ^ < (1 — P^l n )'- Now if W(a>) = k then 

<x> must have revisited some MEC C at least _ 2 J times, i.e. a> e R^ 01 for some 

MEC C. Thus P;. } (W = *) < Id • (1 - P^J^- 11 . Denote a = (1 - pja ). 

We have - 2J < ^ - 3 and thus P* (i) (W = fc) < |g| • a^ST 3 = |g| • a^i/a 3 . 
Since p m [ n < 1/2, we have 1/a 3 < 8. Moreover, from calculus we know that for any 

/ p m \ k 

real number x we have l-x < exp(-x). This gives us P£ (i) (W = k) < 8-|Q|-expl -jjgf ) ' 
and the proof is finished. □ 

The crucial idea behind the proof of Proposition 9 (B) is now the following: when- 
ever the system stays either in some MEC or outside any MEC for some period of time, 
we may approximate its behavior (up to some constant error) using the results of section 
3.1 and standard probabilistic computations, respectively. We show, that it is possible to 
use these approximations to approximate the behavior of the whole system. The error 
of this new approximation now depends on the average number of time intervals when 
run stays in some or outside any MEC. The following crucial proposition formalizes 
this idea. 

Proposition 19. There is a number K € exp^H^H ' 1 ^ that is computable in time poly- 
nomial in \\£fl\\, such that for every memoryless deterministic strategy iz and every initial 
configuration q(i) we have 

E n q(i) >^--K- E W W. 
\t q \ 

Before we present the rather technical proof of Proposition 19, let us make sure that 
it already implies Proposition 9. 

Let q(i) be any initial configuration. Fix a memoryless deterministic strategy n that 
minimizes the expected termination time from q(i). From Lemma 18 we have WW = 
YikLo & ' = k) <8-\Q\- TikLo k -c k - (7^2 • From calculus we now that for every 

32 \Ol~ 

< x < 1 it holds 1 - exp(-x) > x/2 and thus we have E"W < ^f-. Denote this 

upper bound K' . Clearly K' € exp is computable in time polynomial in 

By Proposition 19 we have 

Val(4(0) = E n q(i) >-L-K-K'. 

Vq\ 

Since K ■ K' e exp (||J?l|| 0(1) ), this proves Proposition 9, which is what we needed to 
finish the proof of Theorem 3 in general case. 



Proof of Proposition 19 

Recall, that in the following we assume Q = Q y,-„. 

First, we need to present some technical observations. 

The following lemma is a slight generalization of part (Bl) of Proposition 5. 
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Lemma 20. Let q(l) be any initial configuration such that q e C,for some MEC C of 31 
Denote the random variable that returns the first point in time when the run either 
terminates or reaches configuration of the form r(j) with r £ C. Then under arbitrary 
deterministic strategy n that satisfies Wq(l) < oo we have WT-, > - V ' 



to I 



Proof. Fix an arbitrary initial configuration q(l) and deterministic strategy n with 
E"q(l) < oo. Consider the following stochastic process {m w },>o: 



C w + zsm -i-x c if > i and S w e C, 

C (i) + 1 + zsii-D -i-x c if > i and S {i) £ C, 
m (,_1) otherwise. 



We claim that {m®},>o is a submartingale relative to the filtration {Ti)i>o- The proof 
is again essentially the same as proof of Lemma 3 in [4]. First, the value of m (l \co) 
clearly depends only on finite prefix of u> of length i. Now let u be any history of length 
i. If C (j \u) = for some < j < i or 5 0) £ C for some 0<j<i (i.e. if T^(co) < i for 
all co € Run(u)), then clearly E*(m (i+1) | Run(u)) = m (i \u). 

Otherwise we denote r( j) the last configuration on u and for every possible succes- 
sor r'(j') of r(J) in we set 

= U(u)(r(j) ~» /(/)) if r € Qi 

Pr ' U ' } \Prob(r(j))(r(j) r'(j')) if r 6 go- 

Suppose that r e gi and that n selects a transition to a configuration r'(j') with 
r' £C. Then 

E*(m (! ' +1) | Run(u)) = E 7r (C (i+1) + 1 + z s m - (i + 1) ■ x c I /?Mn(w)) 

= C (0 (m) + E g ( C ( ' +1) - C (0 - x c + 1 +z s m I /?Mn(i<)) - i • x c 

>o 

> C (0 (m) + z s w (u) -i-x c = m (i \u). 

On the other hand, if r e Qq (in which case all successor configurations r'( j') must 
satisfy r' e C) or r e Q\ and n selects transition that stays in C, then we have 

E*(m (i+1) | Run(u)) = E"(C (i+l) + z s «*» - (i + 1) • x c I Run(u)) 

= C ( '\u) + E 7r (C (,+1) - C (0 - x c + z s m) I Run(u)) - i ■ x c 
= C (0 (m)-x c + ^ p^)-(i + I^)-i-xc 



>Z T since (x c ,(z 9 ), eC ) is a solution of X 

> C (0 (m) + z s <o(„) - i • x c = m il \u). 

Thus, },->() is indeed a submartingale. It is easy to see that {m (,) },>o has bounded 
differences. 

Clearly, the membership of every run to in {T^ < n) depends only on finite prefix 
of to of length n, and thus is a stopping time relative to the filtration {Ti)i>o- Also, 
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for every run u> we have T^(cu) < T(cu) and since we assume that Wq(i) < oo, we 
must also have ET_> < oo. Thus the Optional stopping theorem applies and we have 
Wm {Q) < Wm {T ^.BvXm {Q) = l+z q wdm {T ^ < E*C (r -Wax r€C z r +l+|jc c |-E'T_,. This 
gives us E"T^ >(/ + £,- max reC z r - 1 - WC (T ^)/\x c \ >(l-V c -l- WC (T ^)l\x c \. 

□ 

In the following we say that q is a MEC state of J[ if it lies in some MEC of J[. 
Otherwise we say that q is a non-MEC state. 

We call state q' a transient successor of state q if both q and q' are non-MEC states 
and q' is reachable from q along a path that doesn't visit any MEC. We denote the 
maximal number of transient successors of any state in M. 

Lemma 21. Let J{ be arbitrary OC-MDP and let q be arbitrary state of that is 
not contained in any MEC. Then under arbitrary strategy n the probability that, when 
starting in q, we will reach some MEC ofjl in at most nj\ steps, is at least p"^. 

Proof. We inductively define sets Ho, Hi, ■ ■ ■ c 2 lel . We set Ho = \q). Then, we con- 
struct Hi from Hi-i by initially setting Hi = and then performing the following oper- 
ation for every set R € We find a state q R € R such that q R is not contained in any 
MEC of 3\ and {s | q R ~» s] n R = 0. If there is no such state in R, then we add R to H f . 
Otherwise: 

- If q R is a stochastic state, then we set R' - R U \s | q R -~> s) and add R' to //,. 

- If q R is a non-deterministic state, then we denote [s \ q R ^> s) = {si, . . . , s n ). After 
this, we create n new sets R\,...R m where R t = R U {s,}. Finally, we add sets 
Ri,...,R„ to Hi. 

For every i and every R £ //, all the non-MEC states in R are transient successors of 
q. Thus, H nn = H„ M+ i. We claim that every set R € H nyl must contain at least one 
MEC-state of M. Assume, for the sake of contradiction, that there is some R e H n:n 
containing only non-MEC-states. Then R satisfies the following: for every state q e R, 
if q is non-deterministic then there is at least one state s € R such that q^> s; otherwise, 
if q is stochastic, then \s \ q^> s\ c R. This also means, that restriction of J\ to set R, 
i.e. the tuple M R = (R, (R n Q ,R n 20, 6 n (R X {+1,0, -1} x R), {P q ) qeR n Qo \ is again 
a OC-MDP. As every OC-MDP, the J{ R also contains at least one MEC E, which must 
be contained in some MEC of M. This contradicts the assumption that R contains only 
non-MEC states. 

Now let n be arbitrary strategy and i > 0. Denote Ri(n) the set of states that are, 
when starting in q, reached under n in at most i steps. From the construction of H, it 
follows by straightforward induction, that there is some set R e H t such that R c Ri(n). 
In particular, there is some set R e H n;n such that R c R„ x (n). Since R must contain 
at least one MEC-state of J\, there is some history u of length at most n^ such that 
u reaches a MEC state and ¥ K q {Run{u)) > 0. Then clearly ¥ K q {Run{u)) > p n * m and this 
proves the lemma. □ 

Corollary 22. Let q be an arbitrary state of j?L Denote T M the random variable on 
runs starting in q that returns the first point in time, when some MEC of UK is reached. 
Then for arbitrary strategy n and every k > 1 we have P^(Tm > k) < 4d k , where d = 
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Proof. If p m j n = 1 then P^(T M > nj\) = and thus the Lemma trivially holds. Oth- 
erwise we have p m { n < 1/2. From the previous lemma we immediately see that 
F^Tm > k) < (1 - p"^) 1 ^ 1 - We can now compute 

r q (T M >k)<(\ - P n */^ < (i -pZ^' 2 

□ 

Let r and r' be two states of ft that lie in the same MEC C. Then clearly t r = t r > . 
We will denote t c the common value t r of all r in C. 

We now prove Proposition 19 for MEC-acyclic OC-MDPs. We say that a OC-MDP 
ft is MEC-acyclic if there is no cycle in ft containing states from two different MECs. 
Equivalently, one can say that ft is MEC-acyclic if no run in ft returns to some MEC 
once it leaves this MEC. The height of a state q in MEC-acyclic OC-MDP ft, which 
we denote height(q), is the maximal number of MECs visited by any path starting in q. 
The height of a given MEC C is the common height of all its states. 

For any OC-MDP ft we denote ||^ ma xll = max{||J?Icll I C e MEC(ft)}. 

Lemma 23. Let ft be a MEC-acyclic OC-MDP. Then there is a number K = 
exp (||J?[maxll <3<1) ) • 0( n ^/Pn? m ) sucn that ^e fallowing holds for every memoryless de- 
terministic strategy n and every initial configuration q(i): 

FT > ^- - K ■ E*W. (12) 
\t q \ 

Moreover, K is computable in time polynomial in WftaaxW • log(/? m in) ■ n$\ by algo- 
rithm that takes as an input number nji and set of strongly connected OC-MDPs 
{ftc\Ce MEC(ft)}. 

Proof. Recall that we denote d = expi-p^/n^) and set 

4 1 + max CeM£C( ^) V c \ 

(l-d) 2 -|*ol' l*ol J ' 

The asymptotic upper bound on K is easy to check, since numbers jcq and V c for 
C e MEC(ft) are computed by solving linear program £ for MECs of ft; also recall 
that 1/(1 - d) 2 < ^riL/p^ by standard calculus computation. This also shows that K 
can be computed in time polynomial in ||J?l m axll ■ log(/? m in) ■ if we know numbers n&. 
and p min and OC-MDPs ft c for every MEC C of ft. 

Note that in every OC-MDP we have WW < oo under any strategy n (by Lemma 
18). Therefore, both inequalities trivially hold if Wq(i) = oo. From now on we will 
assume that E"q(i) < oo. In particular, we assume that under n the configuration with 
zero counter is reached almost surely from q(i). We proceed by induction on height(q). 
For every height we will prove the inequality separately for q being a non-MEC state 
and MEC-state, respectively. 



(1 -p n * )=w 

(1 _ D n n \2 " " '"nun ' 



min <4(1 -p m )w <4d k 



K = max 
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To start the induction, suppose that q lies in MEC C of height 1 . But then there are 
no transitions leaving C. In particular, we have WW = 1 . From part (B 1) of Proposition 
5 and from K > & we have 

l*ol 

E"q(i)> l -^>^--K. 

\X C \ \tq\ 

The second equality holds because for state q that lies in MEC C with no outgoing 
transitions we have t q - x c . 

Suppose now that q is a non-MEC-state of height h and that (12) holds for all MEC- 
states of height at most h. 

Denote F c the event that the first MEC encountered on a run is C. Note that all 
MECs with W (i) (F c ) > have height at most h. Denote D the union of all MECs C 
with P£ (i) (fc) > 0. Similarly to previous proofs we can write Wq(i) = E"(T 1 + T2) 
where T\ returns the first point in time when the run hits either Dora configuration 
with a zero counter and T2 returns time to hit a configuration with a zero counter after 
hitting D (or 0, if the run terminates before hitting D or never hits D at all). Since both 
these random variables are non-negative, it suffices to prove the required bound (12) for 
WT 2 . 

As in previous proofs, we use the notation D m (for m > 0) for the set of all runs 
that do not terminate before reaching D and at the same time they reach D with counter 
value m. (Also recall that we denote Do set of runs that terminate before or in the exact 
moment of reaching D.) Moreover we denote the event F c n D m . Finally, we denote 

B(l,j,C):= -i--K-(E"(W\Df)-l). 
\tc\ 

Clearly TjCeMEcm,i>Q ^©(^f ) = ^Csmec(M) ^(i)( F c) = 1- 
We have 

E ^2= X P ?W (jF c) • E*(r 2 I F c ). (13) 

CeMEC(m) 

We can write 

r g( i)( F c) ■ E*(r 2 I F c) = J] E"(T 2 I Df) ■ r q(i) (Df) . (14) 
1=0 

By induction hypothesis we have for every / > 

E\T 2 I Df) > -L - K ■ (E"(W I Df) - 1) = B (I, I, C) . (15) 
\tc\ 

Especially for every I > i we have 

E"{T 2 \Df)>-^--K- (E"(W I Df) - 1) = B {I, i, C) . (16) 
\tc\ 

Further, if we denote g t — i - I then for I < i we can write 

E"(T 2 I Df) > ( ' - 8i) - K ■ (ETiW I Df ) - 1) = B (I, i, C) - —. (17) 
I'd I'd 
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We can now plug (16) and (17) into (14) and compute 



CeMEC(Jl) 



-q(i) yrC ' 

\tc\ 



-K- J] (E"(W\Df)-l).r q(i) (Df) 



CeMEC(X), 
I>0 



^ z |z( B ( z '^-^w) + z(h , '^-ii)- p ^( i fl) 

f oo i-l \ 

CeM£C(yi) V 1=0 1=0 1 L 1 



- 2 t(t r *»W) 

CeMEC(ft) 1=0 v 1 L 1 



i-l 



|r «! CeMEC(X) 1=0 " m 

2tJ(^rP! (0 (A)) 
= £ ■ + A" — -. 

\tq\ \M 

From Corollary 22 we have 

2to (ft • P£ (A)) ^ 4 • 2£J ^ _ 4 ■ (=1 ft*" ^ (1^7 



l*ol 



l*ol 



l*ol 



< K, 



since no run in D[ , for / < i, can hit D in less than gi steps. 

This gives us K - jgj • P^ (0 (A) ^ and together with (18) we have 



(18) 



(19) 



E K T 2 > — -K- E K W, 

Vq\ 

which proves that (12) holds for q. 

Suppose now that q lies in MEC C of height h and that (12) holds for all states of 
height h- I. The inequality (12) especially holds for all states q' e Qj in \ C such that 
there is a transition from p to q' for some p e C. We will call every such state q' a 
C-gate and denote G(C) the set of all C-gates. From the definition of t q it follows that 

,77 < tJ-t and ,77 < 77-; for any C-gate q' . 

I', I kd I', I Ivl 

We can again express T as a sum of T\ and 72, where T\ returns the first point in 
time when the run visits configuration r(l) with either r £ C or / = 0, and T2 returns time 
to visit a configuration with a zero counter after leaving C (or 0, if the run terminates 
before leaving C or never leaves C - formally we again have ^(w) = -T\(oj) + T(a>) if 
Ti(a>) < 00 and ^(w) = otherwise). From Lemma 20 we have 



„„, s i-Vc-l- E"C (Tl) i - WC {Ti) V c + 1 

E K (Ti) > — : > 



\xc\ 



\t q \ 



\xa\ 



(20) 
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Now consider T 2 . For state q' not contained in C we denote F 9 the set of all runs 
u) that visit configuration q'il) when they leave C for the first time, i.e. oj e F q t iff 
5 (r '>(w) = q' and C (T, \aj) = I. Note that for every q' such that P! (0 (^f ) > we must 

have g' e G(C). If we denote = U?'eC(Q F? > men it is easv to see tnat E^C 1 - 7 " 1 ' = 
2/eN ' ' P^(,-)^)- Fi nall y> denote lvc the event that the run leaves C at least once (i.e. 
to e Ivc iff to e D q , for some Z and q'). We have 



E-T 2 = J] E*(T 2 \Ff)-F* q(i) (F?) 

?'eC(C), 
/>0 

* Z ((^-^-(E^wi^-Dj-P^f)) 

q C(C), 



E"C (ri) 



- • (B*(W I fv c ) • P^ (/v c ) - ^dvc)) 



\tq\ 



- • (E«(W I Zv c ) • r q(i) (lv c ) - ^(Ivc)) , (21) 



where the inequality on the second line follows from induction hypothesis. 

Denote lv c the complement of lv c . We trivially have E"(W | lv c ) > 1. Putting (20) 
and (21) together we obtain 

E"q(i) >^-.-K- W{W I lv c ) ■ P^o^c) + * ' P£ (fvc) ■ ' ' ' + ' 
I'? I 



> ± - K ■ E*{W I lv c ) ■ r m {lv c ) -Kl(l- py/vc)) 
>^- -K- (E"(W I lv c ) ■ r q{i) {lv c ) + E«{W I • P£ 0) (O 

= — - # • e*w 

M 

Thus, (12) indeed holds for q. □ 
We will now finish the proof of Proposition 19 for arbitrary OC-MDP with Q = 

Qfin- 

To achieve this, for arbitrary OC-MDP J?l and any natural number k we define a new 
MEC-acyclic OC-MDP 3i{k) of height £ + 1; we will augment states of JK with addi- 
tional information, that will allow us to remember number of visits of MECs. Once we 
know that we have left a MEC for the fe-th time, we allow to switch to a new state with a 
counter-decreasing self-loop. To be more specific, call the transition q^> q' a crossing, 
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if there exists a MEC C such that q e C, q' <t C. Then for ft = (Q, (go, Q\), 6, P) we 
set ft(k) = (Q k , (Q k , Q\), 6 k , P k ), where Q k = {(q, I) \ q e Q, 1 < I < k) U {_!_}, and 

8 k = {((q, 0> U (<?'> 0) I (q, U q') e 6 and (q, i, q') is not a crossing } 
U {((q, I), i, (q', I - 1)) | / > 1, (q, i, q) e <5 is a crossing j 
U {((g, 1), z, _L) | 3q' such that (g, i, q') e 6 is a crossing } 
U{(_L,-1,_L)}. 

Partition of states (Q k Q , Q k ) and probability distribution P k is derived from J\ in obvious 
way, we just specifically put ± e (2o- 

Slightly abusing the notation we denote t q k the minimal trend achievable from state 
(q, k) in &{k). 

For every deterministic strategy n in £R there is naturally corresponding deter- 
ministic strategy n(k) in Jl(k), formally defined as follows: for any history H = 
(qo, h))( jo) ■ ■ ■ (q m , lm)(jm) in &l(k) we denote q'(j') configuration of such that 
Mqo(jo) ■ ■ ■ q m {jm)) selects transition leading to configuration q'(j'); then we define 
(n(k))(H) to select transition leading to configuration c of J[(k) such that 



c = 



-1-0') if l m = 1 and q m ~» q' is a crossing, 

(g', Z m - 1)0'') if lm > 1 an d <7m <?' is a crossing, 
(<?', WO') otherwise. 



To differentiate between computations in J?[ and M(k), we again slightly abuse no- 
tation and denote P 71 "'*' and the probability and expected value, respectively, com- 
puted in Jl(k) under strategy n(k). Note that if n is memoryless deterministic, then n{k) 
is also memoryless deterministic. 

It is clear that for any strategy n'mji and any k > 1 we have E"q(i) > E n( - k \q, k)(i). 
We can thus use the Lemma 23 to show that for any memoryless deterministic strategy 
n and any k > 1 we have 



^gii)>J--K.-e^ m W, (22) 



for a suitable number K. Note that for every k the MECs of J[(k) are exactly copies of 
MECs of J{ (with the exception of MEC {-!_}). It is also easy to see that n^t) < \Q\, for 
every k, and that p m i n is the same in and Jl(k) for every k. By Lemma 23 this means 
that K E exp (||J?l|| 0(1) ) can be chosen the same for every k and that it can be computed 
by a polynomial-time algorithm that takes 3i as its input. (This is important observation: 
we do not have to construct any MEC-acyclic OC-MDP in order to compute K.) 
To finish the proof of Proposition 19 it suffices to show that 



i „ _„(*) i 



This is done in following two lemmas. 

Lemma 24. We have lim^^ -L = -L 

1*0*1 i%i 
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Proof. For any k > 1 we clearly have \t q \ 1 > \t q k\ so it suffices to prove that 
lim^oo 77-7 > 777. Fix arbitrary > 1. 

Consider the "fast" counterless strategy p k from Proposition 9, that realizes the min- 
imal trend t q t in J[(k). We define a new strategy p' in J?l as follows: Initially, p' behaves 
exactly as p k , simply omitting the information on current depth stored in states of J[(k). 
When strategy p k prescribes to switch to state _L, the strategy p' starts to behave as the 
"fast" counterless strategy o-\nJ[ from Proposition 9. 

Denote hitk(±) the event that run in J[(k) reaches state _L. Simple computation, 
which uses the fact that, apart from {±}, all MECs of M(k) are copies of MECs of M, 
reveals that 



r' (M C ) C ffl w . / 1 \ 

y _m y _wm ^ „ f Jhit k (±)) • 1 . 

Zj \r r \ Zj \x r \ (qMir \ |f n | I 



From the construction of fl(k) it easily follows that Pf k) (hit k (±)) < V^(W > k). By 
Lemma 18 we have that (W > k) — > as k — > oo. This gives us 

— - lim — < 0, 

\tq\ *- > ° c M 

which proves the lemma. □ 
Lemma 25. We have lim^ W = W. 

Proo/ Fix arbitrary > 1. We have E^ )(;) W = 2/>i ' ■ P S(o (W = Z) and 
E ^ ( ,)W = 2;>i ' • p ^(,)(^ = 0- From the construction of Mk) it easily follows that for 
all / < k we have P£* )(0 (W = /) = P * (0 (W = and thus 



V - E™ m W\ < J] / ■ |P* (W = - Pjg X0 (W = 1 

CO CO 

/=£ l=k 

From Lemma 18 we have that V^{W = I) <b-c l for suitable numbers b and < c < 1. 



Moreover, P^ )(i) (W = /) = for all / > 2 • (fc + 1). Also, since P^ )(0 (W = = 
= for Z < *, we have P£ (i) (W = = X£* P^O? = < c '- Thus 

we can write \E" q(i) W -E*^ m W\ < b-J^k c l +2-(k+ 1) -b-J^k c ' ^ 3 V ^'^Zk c '- 
From standard results on power series we know that 



lim(/c + 1) • V c l < lim Y(7 + 1) • c' = 

l=k l=k 

and thus also lim^^ W q(i) W - E K (q k) m) W\ = 0. This proves the lemma. 
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Fig. 2. The gadget for x 2 when n = 2. Shadow states are the entry points. 
A.4 Proofs of Section 4 

Lemma 26. Given a propositional formula <p in CNF, one can compute a OC-MDP J{, 
a configuration p(K) of ' 3K, and a number N in time polynomial in ||<^|| such that 

- N < \Q\ ■ K, where Q is the set of control states of 3\; 

- if ip is satisfiable, then Yal(p(K)) = N —I; 

- if ip is not satisfiable, then Val(p(K)) = N. 

Proof. Let (p = C\ A • • • A C„ where C\, . . . , C n are clauses over propositional variables 
x\ , . . . , x m . We may safely assume that n > 5. Let Tt\, . . . , n m be the first m prime num- 
bers. For every x„ where 1 < i < m, we construct the gadget shown in Fig. 2. That is, 
we fix 7Ti ■ (n + 1) fresh stochastic control states q l . ( , where 1 < j < jtj and 1 < ( < n + 1, 
and connect them by transitions in the following way: 

" V\,2> for all 2 < ^ < n, q\ n+l q\ t ; 

- for all 2 < j < 7i i we include the following transitions: 

• q' j( -^-> q' jM for all 1 < I < n, 

• q'j n+l — l -* q'j, P where / is either j + 1 or 1 depending on whether j < Ttj or not, 
respectively. 

Since each q l . ( has exactly one successor, all of the above transitions have probability 
one. Also note that the total size of the constructed gadgets is polynomial in ||^|| because 
XXi Tti is 0(m 2 logm) (see, e.g., [2]). 

The control states of the form q' jv where 1 < j < 7r„ are called the entry points for 
Xj. Note that in q'. p the counter is decremented in just one transition, while in the other 
entry points we need n + 1 transitions to decrement the counter. 

An important technical observation about the entry points is the following: For every 
k > 1 and 1 <;'<«, there is exactly one optimal entry point q l . l such that Val(g^. j(fe)) = 
k(n + 1) - n, and for the other entry points q 1 ., x we have that Val(<^., X (K)) = k(n + 1). To 
see this, consider the (unique) k! such that 1 < k! < 7r ; and k = k' + c ■ n t for some c > 0. 
We put j = 1 if ¥ = 1, otherwise j = 7r, - k' + 2. Now one can easily verify (with the 
help of Fig. 2) that Val(^ ^k)) =k(n + l)-n, and Val(^., ^k)) =k(n + l) for the other 
entry points q 1 ., x . 

Every k > 1 encodes a unique assignment : {x\, . . .,x m ) — > {true, false] defined 
as follows: For every 1 < ; < m we put vt(x;) = true iff q\ x is the optimal entry point 
for k. Also observe that for every assignment v : {x\,.. -,x m \ — > {true, false] there is 
some k < YYi=\ n i suc h that v = Vk- 
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We proceed by encoding the structure of C\,...,C n . For each clause 
C( = y,-, V • • • V y it , where every y ih is either x ih or ->Xi H , we fix a fresh non-deterministic 
control state q and add the following transitions for every 1 <h < t: 

- if y ih = Xi h , then we add a transition C{ -^-> q'{ l ; 

- if y ih = -iXi h , then we add a transition cy q h . x for every 2 < j < n ih . 

Using the definition of and the above observation about the entry points, we imme- 
diately obtain that, for all 1 < I < n and k > 1 , 

- vjt(Q) = true iff Val(cy(£:)) = k{n+l) - n + 1; 

- vjfc(Q) = /afae iff ValfoOt)) = Jt(«+1) + 1. 

Now, we add a fresh stochastic control state such that q v c> for every 1 <( <n. 
The probability of each of these transitions is 1 /«. For every > 1 we have that 

- if Vfc(Q) = frae, then Val(^(fe)) = k(n+l) -n + 2; 

- if Vjfe(Q) = false, then at least one clause is false, which implies 

Val(^(fe)) > ?-Ll(k(n+l) - n + 2^ + ^-{k(n+\) + 2^ = k(n+l)-n + 3. 

The construction of J{ is completed by adding a non-deterministic control state p and a 
family of stochastic control states d\,...,d n , where the transitions are defined as follows 
(here we need that n > 5): 

o o , 

- P — p — >du 

- d4—^d 5 , d n — » p;, 

- dj — > dj+\ for all 1 < j < n, j ± 4. 

Let cr be a pure memoryless strategy in Af^ such that 

- in every configuration of the form q (k), the strategy cr selects a transition to some 
optimal entry point for k. If all transitions lead to non-optimal entry points, any of 
them can be selected; 

- in a configuration of the form p(k), the strategy cr selects either the transition lead- 
ing to q,p(k) or the transition leading to di(k), depending on whether v^ip) = true 
or not, respectively. 

Obviously, cr is optimal in all configurations of the form C((k), and hence it is also 
optimal in all configurations of the form q^k). By induction on k, we show that cr is 
optimal in p(k), and Val(p(k)) equals either k(n+\) - n + 3 or k(n+Y) - n + 4, depending 
on whether Vk>(<p) = true for some 1 < k! < k or not, respectively. 

- k = 1. If v\{<p) = true, then E°>(1) = 4. Further, it cannot be that E <r '/?(1) < 4 for 
any pure strategy cr', because 

• if cr' selects the transition from p(l) to d\{\), then inevitably W' p{\) = 5; 

• if cr' selects the transition from p( l) to^(l), then E°"'/?(l) cannot be less than 4 
because cr plays optimally in ^(1). 
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Fig. 3. The example gadget 



If vi(tp) = false, then E°~p(l) = 5, and this outcome cannot be improved by playing 
the transition from p(Y) to q<p(l) because cr is optimal in q v (Y) and 'E J q lf (Y) > 4. 
Hence, cr is optimal in p( 1 ) and Val(/?( 1 )) is either 4 or 5 depending whether v\ (<p) = 
true or not, respectively. 
- Induction step. Let us consider a configuration p(k+l). If Vk+i((p) = true, then 
Wp(k+\) = (k+\)(n+\) - n + 3. Since cr plays optimally in q v (k+V), this out- 
come cannot be improved by any pure strategy cr' which selects the transition from 
p(k+l) to q^ik+l). If cr' selects the transition from p(k+l) to <ii(A:+l), then p(k) is 
inevitably reached in exactly n + 1 transitions. By induction hypothesis, this leads 
to the outcome at least (n+1) + k(n+\) - n + 3 = (k+\)(n+\) - n + 3. Hence, cr is 
optimal and Val(p(k+l)) = (fc+l)(n+l) - n + 3. 

If Vfc+iCy) = /afce, then (by applying induction hypothesis) Wp(k+\) is equal either 
to (n+1) + fc(«+l) - n + 3 or to (n+1) + fe(n+l) - n + 4, depending on whether 
v it'(^) = f™ e for some 1 < k! < k or not, respectively. In both cases, this yields the 
desired outcome which cannot be improved by using the transition from p(k+\) to 
q v (k+\), because then the outcome is inevitably at least (&+l)(n+l) - n + 4. 

Now, it suffices to put K = ]~[™ i tf; and N = K(n+l) -n + 4. Since 7r,- is 0(i log(O), the 
encoding size of J[ is polynomial in \\ip\\, and the length of the binary encoding of K 
and N is also polynomial in \\ip\\. □ 

By Lemma 26, the existence of an algorithm which computes Val(/?(fc)) up to an 
absolute error strictly less than 1 /2 in time 0(f) implies the existence of an algorithm 
for SAT and UNSAT whose time complexity is 0(f o p), where p is a polynomial. The 
same can be said about an algorithm which computes Val(/?(fe)) up to a relative error 
strictly less than 1/(2 • \Q\ ■ k), where Q is the set of control states of J\. Also note 
that stochastic states in J?[ have outgoing edges whose probability is 1 or 1 jn, but it is 
trivial to modify the construction so that all of these probabilities are equal to 1 /2. So, 
Lemma 26 proves Theorem 10 for configurations of the form q(i). Now we show that 
we can even take ;' = 1 . 

Let us consider the following OC-MDP Q^. the set of control states is {po, ■ ■ - ,pk], 
all of these states are stochastic, and there a transition from p t to and pk for all 
i > 1. All transitions increment the counter by 1 and have probability ^. The state po is 
a dead-end with a self-loop. An example for k = 4 is given in Figure 3. 

Lemma 27. With probability higher than j, a run initiated in pti^) visits a configura- 
tion po(i) where i > 2 k . 
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Proof. Notice that the probability of terminating in one step is less or equal to 2~ k , 
because in order to reach po from pk the process has to take a sequence of k transitions, 
as otherwise it restarts at pk- Therefore, the probability that the process does not reach po 
in steps is greater or equal to (1 -2~ k ) 1 . For i = 2 k we have that this value is (1 -2~ k ) 2 , 
but it is well-known that the sequence (1 - £)" is increasing in n and converges to K As 
for n = 2 this expression is equal |, for k > 1 we get that the probability of visiting po 
with the counter value higher than 2 k is at least j. □ 

We also need the following lemma: 

Lemma 28. YYJLi n i - 2 m \ where n m is the m-th smallest prime number. 

Proof. Of course n\ = 2. Bertrand's postulate states that for every k > 1 there is at 
least one prime number p such that k < p < 2k. From this we know that there is at 
least one prime in the following disjoint intervals (2,4), (4, 8), (8, 16), . . . which gives 
us an estimate on the m < 2'. Therefore, U?= i *i < Uti 2 ' = 2 m(m+1)/2 < 2 m2 for all 
m>\. □ 

With the help of Lemma 27 and Lemma 28, we can now prove the following: 

Lemma 29. Given a propositional formula ip in CNF, one can compute a OC-MDP 
& that uses only probabilities i on transitions such that being able to approximate 
Val(^(l)) up to the absolute error | or the relative error 2~' G ', where \Q\ is the number 
of control states ofS, suffices to establish whether ip is satisfiable or not. 

Proof. Let ip be an arbitrary CNF formula, we construct a polynomially sized OC- 
MDP S with probabilities on transitions equal j, such that <p is not satisfiable iff the 
optimal termination time from one of the control states and counter value 1 is equal 
to (n + 2)(2" ,2+1 - 1) - 6, where n and m are the number of clauses and variables in 
ip, respectively. We will build S by combining the gadget @ m i (see Fig. 3), where m is 
the number of propositional variables in <p, with the OC-MDP J?[ that we obtain from 
Lemma 26 for ip. We let the initial state of S be p m 2(V) and the initial control state p 
of J\ replaces the control state po in Q m i. Let Xk denote the probability that J?[ will be 
initiated at p(k + 1) in S, which is the same as saying that executes k transitions 
before reaching control state p. Of course = 1 and thanks to Lemma 27 we have 

Tik>2" 2 x k > J- 

Assume that (p is not satisfiable. We know that the expected termination time from 
p(k) in J{ is equal to k(n + 1) - n + 4 for every k, where n is the number of clauses in <p. 
Therefore Val(/? m 2(l)) = 2Zk x k (k + k(n + 1) - n + 4). Let us consider a Markov chain 
M with positive rewards obtained from Q m i by ignoring the counter completely and 
assigning reward n + 2 to each transition. Notice that the expected total reward before 
M terminates is equal to v := Yik x k • k(n + 2), so Val(p m 2(l)) - v = J^ k Xk(n - 4) = n — 4. 
It is quite straightforward to compute v to be (n + 2)(2 m +1 - 2), and so in the end get 
that Val(/v(l)) = (« + 2)(2 m2+1 - 1) - 6. 

Next, assume that <p is satisfiable. Let k! be the smallest number such that the as- 
signment to the propositional variables corresponding to k! in the proof of Lemma 26 
satisfies ip. We know that k' < 07=1 n m which is < 2 m thanks to Lemma 28. We 



34 



also know that for all k < k' we have Yal(p(k)) = k(n + 1) - n + 4 and for all 
k > k! we have Val(/?(fe)) = fc(n + 1) - n + 3. Therefore in this case Val(/? m 2(l)) = 
Hk<k> *k (k(n + 1) - n + 4) + X* W" + 1) - n + 3) = x^&Cn + 1) - n + 4) - 
Zk>k' x k <(n + 2)(2™ 2+1 - 1) - 6 - Efe^ < (n + 2)(2 m2+1 - 1) - 6 - ±, where 
the last step follows from Lemma 27. Notice that the number of control states in S is 
\Q\ >m 2 + Z m xm(n + 1), so i((n + 2)(2'» 2+1 - 1) - 6) < 2-121. □ 
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